Picture for Fanyi Pu

Fanyi Pu

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

Add code
May 13, 2026
Viaarxiv icon

Demystifing Video Reasoning

Add code
Mar 17, 2026
Viaarxiv icon

Scaling Spatial Intelligence with Multimodal Foundation Models

Add code
Nov 17, 2025
Figure 1 for Scaling Spatial Intelligence with Multimodal Foundation Models
Figure 2 for Scaling Spatial Intelligence with Multimodal Foundation Models
Figure 3 for Scaling Spatial Intelligence with Multimodal Foundation Models
Figure 4 for Scaling Spatial Intelligence with Multimodal Foundation Models
Viaarxiv icon

Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients

Add code
May 03, 2025
Viaarxiv icon

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Add code
Jan 23, 2025
Figure 1 for Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Figure 2 for Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Figure 3 for Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Figure 4 for Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Viaarxiv icon

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Add code
Jul 17, 2024
Figure 1 for LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Figure 2 for LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Figure 3 for LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Figure 4 for LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Viaarxiv icon

WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning

Add code
May 06, 2024
Figure 1 for WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning
Figure 2 for WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning
Figure 3 for WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning
Figure 4 for WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning
Viaarxiv icon

OtterHD: A High-Resolution Multi-modality Model

Add code
Nov 07, 2023
Figure 1 for OtterHD: A High-Resolution Multi-modality Model
Figure 2 for OtterHD: A High-Resolution Multi-modality Model
Figure 3 for OtterHD: A High-Resolution Multi-modality Model
Figure 4 for OtterHD: A High-Resolution Multi-modality Model
Viaarxiv icon

MIMIC-IT: Multi-Modal In-Context Instruction Tuning

Add code
Jun 08, 2023
Figure 1 for MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Figure 2 for MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Figure 3 for MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Figure 4 for MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Viaarxiv icon